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Abstract 

Deep neural networks have been exhibiting splendid ac¬ 
curacies in many of visual pattern classification problems. 
Many of the state-of-the-art methods employ a technique 
known as data augmentation at the training stage. This pa¬ 
per addresses an issue of decision rule for classifiers trained 
with augmented data. Our method is named as APAC: the 
Augmented PAttern Classification, which is a way of classi¬ 
fication using the optimal decision rule for augmented data 
learning. Discussion of methods of data augmentation is 
not our primary focus. We show clear evidences that APAC 
gives far better generalization performance than the tradi¬ 
tional way of class prediction in several experiments. Our 
convolutional neural network model with APAC achieved 
a state-of-the-art accuracy on the MNIST dataset among 
non-ensemble classifiers. Even our multilayer perceptron 
model beats some of the convolutional models with recently- 
invented stochastic regularization techniques on the CIFAR- 
10 dataset. 

1. Introduction 

Output of an ideal pattern classifier satisfies two proper¬ 
ties. One is the invariance under replacement of a data point 
by another data point within the same class, and we refer 
this as to intra-class invariance. The other is the distinctive¬ 
ness under replacement of a data point in one class by a data 
point in another class, and we refer this as to inter-class dis¬ 
tinctiveness. Good classifiers more or less have these prop¬ 
erties for untrained data. 

For a given class, there exists a set of transformations 
that leave the class label unchanged. In case of visual ob¬ 
ject recognition of “apple”, the class label stays the same 
under different lighting conditions, backgrounds, and poses, 
to name a few. One can expect that a classifier gains good 
intra-class invariance through learning dataset containing 
many images with these variations. 

A classifier should also show inter-class distinctiveness 
to distinguish one class from the other. If one construct a 


training dataset containing green apple class and red apple 
class, lighting condition must be paid careful attention, be¬ 
cause important color feature may be spoiled under some 
lighting condition. Appropriate types and ranges of varia¬ 
tions depends on the problem setting. 

For an image classifier to gain good intra-class in¬ 
variance without compromising inter-class distinctiveness, 
there are largely two types of approaches. One approach is 
to embed some mechanisms in classifiers to give robustness 
against intra-class variations. One of the most successful 
classifiers would be Convolutional Neural Network (CNN) 
[11]. CNN has two important building blocks: convolution 
and spatial pooling, which give robustness against global 
and small local shifts, respectively. These shifts are a typi¬ 
cal form of intra-class variation. 

Another approach is data augmentation, meaning that a 
given dataset is expanded by virtual means. A common way 
is to deform original data in many ways using prior knowl¬ 
edge on intra-class variation. Color processing and geo¬ 
metrical transformation (rotation, resizing, etc.) are typical 
operations used in visual recognition problems. Adding vir¬ 
tual data points amounts to making the points denser in the 
manifold that the class instances form. Strong regulariza¬ 
tion effects are expected through augmented data learning. 

Augmented data learning is also beneficial in an engi¬ 
neering point of view. Dataset creation is a painstaking 
and costly part in product development. Data augmentation 
allows the use of prior knowledge on recognition targets, 
which engineers do have in most cases, and thus provides 
easy and cheep substitutes. Secondly, quality of virtual data 
can be easily evaluated by human perception. In case of vi¬ 
sual recognition task, one can check virtual images whether 
they resemble real ones by eyes. 

Many of state-of-the-art methods in generic object 
recognition problems use deep CNNs, trained on aug¬ 
mented datasets comprising original data and deformed data 
(see recent works [22, 20, 9, 6]). It has been pointed out that 
CNN models with many layers have great discriminative 
power; on the other hand, theoretical and methodological 
aspects of data augmentation are not fully revealed. 
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1.1. Related work 

Data augmentation plays an essential role in boosting 
performance of generic object recognition. Krizhevsky et 
al. used a few types of image processing, such as random 
cropping, horizontal reflection, and color processing, to cre¬ 
ate image patches for the ImageNet training [9]. More 
recently, Wu et al. vastly expanded the ImageNet dataset 
with many types of image processing including color cast¬ 
ing, vignetting, rotation, aspect ratio change, and lens dis¬ 
tortion on top of standard cropping and flipping [22]. Al¬ 
though these two works use different network architectures 
and computational hardware, it is still interesting to see the 
difference in the performances levels. The top-5 prediction 
error rate of the latter is 5.33%, while that of the former is 
16.42%. Such a large gap could be an implicit evidence that 
richer data augmentation leads to better generalization. 

Paulin et al. proposed a novel method for creating aug¬ 
mented datasets [15]. It greedily selects types of transfor¬ 
mations that maximize the classification performance. The 
algorithm requires heavy computational resources, thus the 
exhaustive pursuit is almost intractable when deep networks 
with a huge number of parameters are trained. 

Handwritten character/digit recognition has been an im¬ 
portant problem for both industrial applications and algo¬ 
rithm benchmarking for a quarter century [23, 10, 17, 11, 
2, 3]. The problem is relatively simple in a sense that there 
is no degree of freedom in the background and that stroke 
can be easily modified. Elastic distortion is a commonly 
used data augmentation technique that has a good property 
in giving a large degrees of freedom in the stroke forms, 
while leaving the topological structure invariant. Indeed, 
data augmentation by elastic distortion is crucial in boost¬ 
ing classification performance [17, 2, 3]. 

In case of pedestrian detection, use of synthetic pedes¬ 
trians in real background [14] and synthetic occlusion [1] 
have been proposed. Though these approaches give addi¬ 
tional degrees of freedom in expanding training datasets, 
we omit such means in this work. 

Data augmentation can be categorized into two: off-line 
and on-line. In this work, off-line data augmentation means 
to increase the number of data points by a fixed factor before 
the training starts. The same instance is repeatedly used in 
the training stage until convergence [17]. On-line data aug¬ 
mentation means to increase the number of data points by 
creating new virtual samples at each iteration in the training 
stage (see representative works: [3, 2]). There, random de¬ 
formation parameters are sampled at each iteration, hence 
the classifier always “sees” new samples during the train¬ 
ing. Cire§an et al. claims that on-line scheme greatly im¬ 
proves classification performance because learning a very 
large number of samples likely avoids over-fitting [3, 2]. 
Our work is mostly inspired by their work, and is focused 
on the on-line deformation. 


Very recently, an website article reported a method 
named as Test-Time Augmentation [4], where prediction 
is made by taking average of the output from many virtual 
samples, though the algorithm is not fully described. 

Tangent Prop [16] is a way to avoid over-fitting with im¬ 
plicit use of data augmentation. Virtual samples are used to 
compute the regularization term that is defined as the sum of 
tangent distances, each of which is the distance between an 
original sample and a slightly deformed one. It is expected 
that classifier’s output is stable in the vicinity of original 
data points, but not necessarily so in other locations. 

1.2. Contribution 

This paper proposes the optimal decision rule for a given 
data sample using classifiers trained with augmented data. 
We do not discuss methods of data deformation themselves. 
Throughout this paper we assume that training is done with 
data samples deformed in on-line fashion. That is, random 
deformation parameters are sampled at every iteration, and 
a deformed sample is used only once and discarded after a 
single use. Such training minimizes an expectation value 
of loss function over random deformation parameters. We 
claim that class decision must be made so as to minimize 
the same expectation value for a given test sample. 

We show by experiments that the proposed decision rule 
give lower classification error rates than the conventional 
decision rule. APAC improves test error rate of CNN by 
0.16% for MNIST and by 9.72% for CIFAR-10. To the best 
of our knowledge, the improved error rate for MNIST is the 
best among non-ensemble classifiers reported in the past. 

Though we believe that the proposed decision rule is 
beneficial to any classification problem, in which aug¬ 
mented data learning is applied, image classification prob¬ 
lems are mainly discussed in this paper because we have not 
conducted experiments in other fields. 

2. On-line data deformation 

On-line data deformation learning can generate classi¬ 
fiers with strong intra-class invariance. Such learning gen¬ 
erally consumes many iterations to reach a minimum of the 
objective function. A vast number of training instances are 
processed because the number of instances increases lin¬ 
early as the number of iterations increases. In the on-line 
deformation scheme, the original data themselves are not 
trained explicitly -they are only trained probabilistically. 

In this section we provide a formal definition of aug¬ 
mented data learning, which has been treated rather heuris- 
tically so far. Let us first define the data deformation func¬ 
tion as rt : ^ where d is the dimension of the orig¬ 

inal data.^ The function u{x; 0) takes a datum x G and 

^The data deformation function can be generalized to u : —)■ 

with do / , but we consider do = di case in this study for simplicity. 



deformation-controlling parameters 0 = { 6 >i, • • • , 
and returns a virtual sample. Each element of the set 0 
is defined as a continuous random variable for convenience. 
Some are responsible for continuous deformation; e.g., 0i 
being scaling factor, O 2 being horizontal shift, etc. The other 
are responsible for discrete deformation; e.g., if 0^ G [ 0 , 
horizontal side is flipped, and if 6^3 G [^, 1] no side-flipping 
is performed, where 6^3 ^ Z//(0,1). We use class label c 
in the superscript, 0 ^, if deformation is done in a class- 
dependent fashion. In this work, it is assumed that proba¬ 
bility density functions of deformation parameters are given 
at the beginning and held fixed during training and testing. 
In the following, we consider two cases: 1) the way of de¬ 
formation being same for all classes, and 2) the others. We 
use the cross entropy as the loss function as it is most widely 
used for Deep Learning with supervised setting. The cross 
entropy requires vector normalization in the output units, 
where we use the softmax function. 

2.1. Class-indistinctive deformation learning 

We first discuss the case 1). Let i G {1, • * * ? denote 
an index of original training data, q G {1, • * * ? ^c} denote 
the class index of i-th sample, W denote the set of all 
parameters to be optimized, and /( • ;W) : 
denote a function realized by a neural network with the 
softmax output units. Let fc be the c-th component of the 
output, then /c = 1 and /c > 0, Vc e {1, • • • , 

In the following, regularization terms are ignored for 
simplicity. 

Class-indistinctive deformation learning: 

Given V = {{xi, q)}, i = 1, • • • ,N, find such that 

= arg min Jx)(IL), (1) 

w 

where the objective function J'Z){W) is defined as 

N 

MW) = ^ Ee [- In (d, {u {xf, 0); W^))]. ( 2 ) 

i=l 

The expectation value is computed by marginalizing the 
cross entropy over deformation parameters that indepen¬ 
dently obey unconditional probability densities pk{0k)^k = 
1, • • • , if. By using appropriate random number genera¬ 
tors, one can generate counties sly many virtual samples dur¬ 
ing training. By sufficiently reducing the objective function, 
the classifier outputs a value close to the target value for an 
arbitrarily deformed training image. That means, the classi¬ 
fier gains a high level of intra-class invariance with respect 
to the set of deformations applied, without compromising 
inter-class distinctiveness. 

Deformation must be meaningful for all classes in class- 
indistinctive deformation learning. It may be homography 


transformation or global color processing such as gamma 
correction, to name a few. 

A truly intra-class-invariant classifier would be obtained, 
if the integrals E©[-] = / * * * / II/c d^kPk{0k){') could be 
analytically calculated. However, it is hard to integrate out 
in reality. Then one needs to convert the integral into a sum 
of infinitely many terms. 


Ee[ • ] = lim - 

R—Yoo 


E 


(• )• 


(3) 


0 = 0(1),... ,0(fi) 


Here, 0^^^ = {0^\ • • • , 0^^} is a set of deformation pa¬ 
rameters at ^-th sampling, based on the unconditional prob¬ 
ability density functions Pk{')^k = 1, • • • , if. With this 
summation form, the objective function can be approx¬ 
imately minimized by widely-used mini-batch Stochastic 
Gradient Descent (SGD). Note that a batch optimization al¬ 
gorithm is no longer applicable in a strict sense because the 
number of terms are infinite. At each iteration in the opti¬ 
mization process, data indices and deformation parameters 
are randomly sampled to generate a mini-batch. The mini¬ 
batch is discarded after a single use. The total number of 
terms in the objective function is determined when the train¬ 
ing is terminated. It is clear from Eq. (2) that the original 
data samples are not explicitly fed into the network. 

We believe that the original data should not be used for 
validation, as opposed to the statement made by Cire§an et 
al. [3, 2], where they claim that the original data can be 
used for validation. The original and deformed data have 
strong correlations in the feature space especially when de¬ 
formation is moderate. Therefore, it is advised not to use 
the original training data to estimate the generalization per¬ 
formance. 

In the experiment, we employ class-indistinctive defor¬ 
mation learning. 

2.2. Class-distinctive deformation learning 

Next, we discuss the case 2), where the probability 
densities for deformation parameters depend on the classes. 
Although such scheme requires one to design deformation 
in a class-specific way, it is likely to give a stronger 
inter-class distinctiveness to the classifier. Lor example, 
it is not probably a good idea to cast color with strong 
red component to an image belonging to “green apple” 
class, when there is “red apple” class, for an obvious 
reason. But casting red color to an image belonging to, say, 
“grape” class may be reasonable. In the hand-written digit 
classification problem, Cire§an et al. have used different 
ranges of deformation parameters for certain classes [ 2 ]: 
rotation and shearing applied to digit 1 and 7 are less 
stronger than other digits. Their work is another example 
of class-distinctive deformation learning. 


Class-distinctive deformation learning: 

Given V = {{xi, q)}, i = 1, • • • ,N, find such that 

= arg min JviW), (4) 

w 

where the objective function Jx>{W)i^ defined as 

N 

MW) = ^E© [- In (/c. {u {Xi-, 0); W")) [c^]. (5) 

i=l 


For an arbitrary i-th empirical sample, expectation value 
is computed by marginalizing over deformation parameters: 
Ee[-|ci] = I ■ ■ ■ I U.kd‘^kPk{Ok\ci){-). Here, the A:-th de- 
formation parameter obeys a class-conditional probability 
density p/c(6>/c I Ci).^ 

The optimization procedure is similar to that of the class- 
indistinctive deformation learning, except that deformation 
parameters obey conditional probabilities. The integral can 
be rewritten by a sum of an infinite number of terms. 


1 ■ s E 

© = ©c(l)^... ^©c(fi) 


( ■ ), (6) 


where 0^^^^ = is a set of deforma¬ 

tion parameters at ^-th sampling, based on the PDFs: 
Pk{^k\c)^k = c = Some form of 

SGD can be used to minimize the objective function with a 
finite term approximation. 


3. Decision rule for augmented data learning 

In this section we propose a new way of classification, 
APAC: Augmented PAttern Classification, and claim 
that it gives the optimal class decision for augmented data 
learning described in the previous section. It is shown 
that a single feedforward of a given test sample is no 
longer optimal when one minimizes the expectation value 
at the training stage. Cross entropy loss with softmax 
normalization is assumed in the following discussion. 

APAC for class-indistinctive deformation learning: 

Given parameters W and data x, find c* such that 

c* = arg min (7) 

,JVe} 


APAC for class-distinctive deformation learning: 

Given parameters W and data x, find c* such that 

c* = arg min M,c)}{W). (8) 

ce{i,-,iVe} 

^There may be a case where certain types of deformation are only ap¬ 
plied to selected class(es). In such a case, a delta function is used as PDF 
to “turn-off’ the deformation for other classes; i.e., pk {Ok\c) = S(6k)- 



Figure 1. APAC, the proposed way of classification (above). Non- 
APAC, conventional way of classification (below). 


It is obvious from Eq. (7, 8) that class decision making 
is an optimization process requiring minimization of the ex¬ 
pectation values. The expectation value for a given data 
sample must be computed at test stage, as it is minimized 
through training stage (with some approximation). Note 
that the test sample itself is not fed into the classifier. In 
practice, finite-term relaxation must be made at test stage to 
estimate the expectation value: 


E©[ ■ ] 

“ s E (O 

© = ©(!),... ,©(M) 

(9) 

Ee[ ■ |c] 

“ s E <■) 

© = ©c(l),... ,©c(M) 

(10) 


for the class-indistinctive case and class-distinctive case, re¬ 
spectively. This means, a finite number of sets of deforma¬ 
tion parameters must be randomly sampled using the same 
probability density functions used in the training. APAC re¬ 
quires to average the logarithms of the softmax output, and 
then take the maximum argument to give optimal predic¬ 
tion. The process fiow is depicted in Fig. 1. We emphasize 
that taking logarithm is an important step, otherwise an ir¬ 
relevant quantity gets minimized at the test stage and classi¬ 
fication performance likely degrades. APAC is equivalent to 
picking the maximum argument of the product of the soft- 
max output, which is analogous to selecting the largest joint 
probability among individual class-probabilities of many 
virtual instances. For a sufficiently trained classifier, it 
is expected that generalization performance asymptotically 
reaches the highest as the number of terms, M, increases. 

The decision rule for class-distinctive deformation learn¬ 
ing requires to generate plural sets of virtual samples for a 
given test image. Suppose one uses sets of deforma¬ 
tions^ at the training stage, then at the testing stage a data 
sample has to be deformed in NdM different ways. Then 
average of M logarithms of softmax output is computed for 
each class, using the corresponding deformation type. A 
maximum argument is then picked to predict a class. 


= Nc when each class has a unique deformation set, and < 
Nc when two or more classes share the same type of deformations. 



































4. Experiments 

Experiments on image classification are carried out to 
evaluate generalization abilities of APAC. 

4.1. Datasets 

Two datasets are used in the experiments. 

MNIST [10]. This dataset contains images of handwrit¬ 
ten digits with ground truths. It has 60K training and lOK 
testing samples. There are ten types of digits (0-9). The im¬ 
ages are gray-scaled with 28 x 28 size. Background has no 
texture. 

CIFAR-10 [ 8 ]. This dataset is for benchmarking the 
coarse-grained generic object classification. It has 50K 
training and lOK testing samples. The labels are: plane, car, 
bird, cat, deer, dog, frog, horse, ship, and truck. The images 
are colored with 32 x 32 size. Foreground objects appear in 
different poses. Background differs in each image. 

4.2. Image deformation 

Class-indistinctive deformation learning is carried out in 
all experiments. Details of deformation are given below. 
Some processed images are shown in Fig. 2. 

Deformation on MNIST. We employed random (1) ho- 
mography transformation, (2) elastic distortion, and (3) line 
thickening/thinning. 

(1) Homography transformation. Image is projectively 

transformed by homography matrix H. The eight elements 
are assigned as Gaussian random variables: Hii^H 22 ^ 
Ar(l, 0.1^), i/i2, i^l3, ^21, ^23, - ^{ 0 , 0.1^). 

(2) Elastic distortion. We followed Simard et al. [17], ex¬ 
cept for parameter setting. We used 6.0 standard deviation 
for the Gaussian filter, and 38.0 for a, the enlargement fac¬ 
tor to the displacement fields. 

(3) Line thickening/thinning. Morphological image dilation 
or erosion is adopted on interpolated images, with probabil¬ 
ities \ and respectively. No line thickening/thinning is 
done with probability of ^. 

Deformation on CIFAR-10. We used the ZCA- 
whitening [ 8 ] followed by random ( 1 ) scaling, ( 2 ) shifting, 
(3) elastic distortion and side-flipping with probability of 

(1) Scaling. Image is magnified by a factor s, randomly 
picked from continuous uniform distribution, Z7(1.0,2.0). 
Here, 1/s is the step size of image interpolation. 

(2) Shifting. Random cropping is applied in following fash¬ 
ion. The x-component of the top-left corner of the in¬ 
terpolated patch is determined by sampling a value from 
U{0,Sx{l — !/<§)), where Sx is the horizontal size of the 
original image. Shift along ^-axis is determined in the same 
way but the sampling is done independently. 

(3) Elastic distortion. We used 8.0 standard deviation for 
the Gaussian filter, and 40.0 for a. 

A few comments on elastic distortion to CIFAR-10 are 
in order. Applying elastic distortion could be harmful for 
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(a) MNIST (b) CIFAR-10 

Figure 2. Visualization of image deformation. The ZCA- 
whitening part is skipped for visibility. 


images of rigid objects such as plane or car, but could be 
beneficial for images of flexible objects such as cat or dog. 
Nevertheless, we applied elastic distortion to all classes in 
the same random manner based on two thoughts: 1) A class 
is not likely to be altered by elastic distortion even if the 
resultant image looks somehow unnatural, and 2) Breaking 
spatial correlation helps avoiding over-fitting. It is not our 
intention to state that elastic distortion is particularly impor¬ 
tant for generic object classification. 

4.3. Network architectures 

We evaluated CNN and MLP for each of the datasets. 
In all networks, the ReLU activation function [9] was used. 
We trained and evaluated a single model for each of the ex¬ 
periments; i.e., no ensembles of classifiers are used. We did 
not impose any stochasticity to the networks during train¬ 
ing, such as dropout [7] or dropconnect [21]. Network ar¬ 
chitectures used in our experiments are presented in Table I . 

CNN models. For MNIST, we used the same numbers 
of layers and maps in each layer as in [3], but we used 
5x5 convolutional kernels in all convolutional layers (we 
use symbol C 5 ), whereas different sizes were used in [3]. 
For CIFAR-10, we just set the architecture by hand without 
any validation. A non-overlapping maximum pooling with 
g X g grid size (we use the symbol P^) follows each con¬ 
volution and activation. We use the symbols F and S for 
fully-connected and softmax, respectively, in the tables. 

MLP models. Numbers of layers are determined by val¬ 
idations for both datasets; it turned out that 3 weight-layers 
were the best in both datasets. Numbers of hidden units for 
MNIST, 2500 and 2000, are the same as in [2]. Numbers of 
units for CIFAR-10 are set by hand without any validation. 
Softmax normalization is applied to the output units. 

4.4. Training details 

Mini-batch SGD with momentum was used in every 
experiment. Initial values of learning rates are: 2~^ for 
MNIST-CNN, 2-^ for MNIST-MLP, 2“^ for CIFAR-10- 






Table 1. The network architectures. Top: MNIST-CNN model. 
Middle: CIFAR-IO-CNN model. Bottom: MLP models. 


layer 

0 

1 

2 

3 

4 

5 

6 

# maps 

1 

20 

20 

40 

40 

150 

10 

map size 

28^ 

24^ 

12^ 

8^ 

42 

1^ 

1^ 

operation 

Cs 

P2 

Cs 

P2 

F 

F 

S 


layer 

0 

1 

2 

3 

4 

5 

6 

7 

8 

# maps 

3 

64 

64 

128 

128 

256 

256 

128 

10 

map size 

32^ 

CO 

0 

to 

10^ 

8^ 

42 

2^ 

1^ 

1^ 

1^ 

operation 

C 3 

P 3 

C 3 

P2 

C 3 

P 2 

F 

F 

S 


layer 

0 

1 

2 

3 

MNIST 

784 

2500 

2000 

10 

CIFAR-10 

3072 

4096 

3072 

10 


CNN, and 2“^ for CIFAR-IO-MLP. Learning rate is mul¬ 
tiplied by 0.9993 after each epoch^. The momentum rate 
is fixed to 0.9 during training. The mini-batch size is 
100. Training data are randomly sampled with replacement, 
meaning that the same empirical sample can be sampled 
more than once in the same mini-batch, but deformation is 
done independently. Training is terminated at 15K epochs. 
We confirmed that 15K epochs gave sufficient convergences 
through validation. We added L 2 regularization terms with 
5e-6 factor to the MNIST-MLP cost function and with 5e-7 
factor to all the rest of the cost functions. 

4.5. Classification performance 

We first compare classification accuracies between the 
different decision rules. Table 2 shows test error rated pro¬ 
duced by APAC and non-APAC. In the experiments, M, 
the number of virtual samples created from a given image 
at the testing stage (see Eq. (9)), is varied from 1(= 4^) 
to 16,384(= 4^). Our claim is to use as large M as pos¬ 
sible to give class prediction, so the APAC results shown 
in Table 2 are those with M =16,384. In all experiments, 
APAC consistently gives superior accuracies compared to 

Table 2. Summary of test error rates produced by our experiments. 
Finite-term approximation with M =16,384 is taken in the APAC 
results. Non-APAC means the conventional way of prediction, in 
which each original test sample is fed into the network. 


Trained on 

augmented data 

original data 

Tested by 

APAC 

non-APAC 

non-APAC 

MNIST 

CNN 

MLP 

0.23% 

0.26% 

0.39% 

0.29% 

0.69% 

1.49% 

CIFAR-10 

CNN 

MLP 

10.33% 

14.07% 

20.05% 

23.20% 

22.63% 

55.96% 


^Here, an “epoch” equals the number of iterations needed to process N 
virtual samples, where N is the number of original training data. 



1 4 16 64 256 1024 4096 16384 

M (# virtual samples at test time) 


Figure 3. Test error rates of our CNN and MLP models on MNIST. 
Classification performance of APAC is plotted as a function of the 
number of virtual samples created at the test time. Non-APAC 
(prediction made by a single feedforwarding of an original test 
sample) results are also shown in the figure with texts. In both 
cases, the same weights are used. 
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Figure 4. All MNIST test samples misclassified by our CNN 
model. In each figure, ground truth is printed at the top-left corner. 
The bar plot in each figure indicates softmax output of the 1st and 
2nd predictions. 

non-APAC -prediction made by feedforwarding the origi¬ 
nal test samples- albeit they use the same weight trained 
with augmented data. 

4.5.1 Performance on MNIST 

We evaluate how classification accuracies change as M 
goes to a large value. Plot for M versus the test error rate 
is shown in Fig. 3. The tendency that the classification ac¬ 
curacy raises as M increases for both networks is clearly 
observed. This is due to the fact that the expected loss 
is better estimated as M gets larger. 

Our CNN model achieved 0.23% test error rate. To the 
best of our knowledge, this test error is the best when a sin¬ 
gle model is evaluated. We used no ensemble classifiers, 
such as model averaging or voting. (The best test error rate, 
0.21%, was achieved by Wan et al. [21], where voting of 
five models was used.) Training was done only once in 
each of our experiments. All misclassified test samples are 
shown in Fig. 4. The top-2 prediction error rate is as low 
as 0.01%; i.e., there is only one misclassified sample out of 
lOK test samples, with our CNN model. 
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1 4 16 64 256 1024 4096 16384 

M (# virtual samples at test time) 


Figure 5. Test error rates of our CNN and MLP models on CIFAR- 
10. See Fig. 3 for detail. 

Our single MLP model achieved 0.26% test error rate. 
To the best of our knowledge, this is the best record among 
MLP models reported previously. The best MLP error rate 
reported in the past was 0.35%, which was achieved by 
Cire§an et al. [2]. They used a single MLP model that has 
around 12. IM free parameters and 5 weight layers, whereas 
our MLP model has around 7.0M parameters and 3 weight 
layers. Though our model is smaller in both the number of 
free parameters and the network depth, ours reaches signifi¬ 
cantly better classification performance. Our MLP model 
has, again, 0.01% top-2 prediction error rate on the test 
dataset; i.e., there is only one misclassified sample. Inter¬ 
estingly, the very same test sample (shown at the top-left 
position in Fig. 4) is misclassified by our CNN and MLP 
models, and all other 9,999 samples are correctly classified 
within two guesses. 

4.5.2 Performance on CIFAR-10 

Plot for M, the number of virtual samples generated at the 
testing stage, versus the test error rates is given in the Fig. 5. 
The tendency that generalization performance raises as the 
number of virtual samples increases is also observed. Gen¬ 
eralization of non-APAC is significantly inferior to that of 
APAC for both architectures. 

Our single CNN model results in 10.33% test error 
rate. This error rate is better than the multi-column CNN 
(11.21%) [3] and the deep CNN reported by Krizhevsky et 
al. (11%) [9], and worse than the Bayesian optimization 
method (9.5%) [18], Probabilistic Maxout (9.39%) [19], 
Maxout (9.35%) [5], DropConnect (9.32%) [21], Network- 
in-Network (8.8%) [13], and Deeply-Supervised Nets 
(8.22%) [12]. Our result is not close to those of the state- 
of-the-art methods. However, we believe that APAC can 
even improve the generalization abilities of these high- 
performing methods if augmented data learning is adopted. 

Our single MLP model yields 14.07% test error rate. 
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Figure 6. Illustration of APAC prediction of a class-marginal sam¬ 
ple. The violet and light blue points corresponds to the class-9 and 
class-5 test data points, respectively, of MNIST. The red points 
corresponds to the virtual data points created from a particular test 
sample. See the text for more details. 

This error rate is worse than the multi-column CNN 
(11.21%) [3], but better than the CNN with stochastic pool¬ 
ing method (15.13%) [24] and the CNN with dropout in 
final hidden units (15.6%) [7]. We are aware that fully- 
connected neural networks are easy to over-fit when used 
for image classification tasks. But still, this experiment 
gives an evidence that a fully-connected network trained 
with augmented data and tested with APAC can outperform 
CNNs trained with recently invented regularization tech¬ 
niques and without augmented data [24, 7]. 

4.6. Analysis 

All the experiments we conducted showed that APAC 
consistently gives better test error rate than non-APAC, a 
way of class prediction through single feedforwarding of 
original (non-deformed) data, when augmented data are 
learned. Let us illustrate how the class prediction gets al¬ 
tered between the two decision rules in the case of MNIST. 
Figure 6 (a) shows a scatter plot of test data points of class-5 
and class-9 in a 2D subspace of the linear output space, with 
X and ^-axis corresponding to class-5 unit and class-9 unit. 
There, weights are obtained through the class-indistinctive 
deformation learning, and plotted data points do not involve 
image deformation. A test sample, whose image is super¬ 
posed in the plot, would be misclassified to class-5 by non- 
APAC. We deform this test sample in 1,000 different ways, 
and plot these virtual data points in Fig. 6 (b). One obser¬ 
vation is that the virtual data points lie close to the original 
point. This is not so surprising because the original and the 
virtual images share many features in common, and the net¬ 
work is trained to be insensitive to the differences amongst 
these samples; namely, weak homography relation, elastic 
distortion, and line thickness. The other observation is that 
the majority (661 out of 1,000) of such virtual data points 
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Figure 7. Visualization of randomly selected weight maps in the 1st weight layers of the MLP models trained with: (a) augmented MNIST, 
(b) non-augmented MNIST, (c) augmented CIFAR-10, and (d) non-augmented CIFAR-10. 


are in favor of the true class (’9’). Indeed, APAC predicts 
the true class from the 1,000 virtual samples. An important 
point is that there is a better chance of predicting the correct 
class by taking the product of softmax output of many vir¬ 
tual samples created from a given test sample, rather than 
by using the softmax output of the test sample. 

One might wonder what happens if summation, instead 
of product, of softmax output of many virtual samples is 
taken at test stage. Just for the record, we list the results be¬ 
low. Test error rates produced by taking the maximum argu¬ 
ment of the softmax sum with M =16,384 are: 0.24% for 
MNIST-CNN, 0.27% for MNIST-MLP, 10.42% for CIFAR- 
10-CNN, and 14.01% for CIFAR-10-MLP. Softmax product 
gives better performance in all cases except for the CIFAR- 
10-MLP. We do not have a clear explanation why one out of 
four experiments exhibits opposite result, but it is safer and 
more meaningful to use softmax product so as to maximize 
the joint probability among individual class-probabilities of 
many virtual instances. 

4.7. Some remarks on augmented data learning 

We make some remarks on how augmented data learn¬ 
ing make difference in weights. Figure 7 shows the trained 
weight maps in our MLP models.^ 

Trained weights for MNIST. The weight maps obtained 
through the augmented data learning have local-feature sen¬ 
sitive patterns (see Fig. 7 (a)). It has been argued that local 
feature extraction plays an important role in visual recogni¬ 
tion. Combining local features in a certain way gives dis¬ 
criminative information about the entire object. CNN is one 
particular way to embody such strategy. But, MLP is not, 
in a sense that local-feature extractor is not built-in. Nev¬ 
ertheless, it is not impossible to give local-feature extrac¬ 
tion ability to an MLP as Fig. 7 (a) indicates. On the con¬ 
trary, the weight maps obtained through original data learn¬ 
ing have only global patterns (see Fig. 7 (b)), implying that 
over-fitting to the training data takes place. 

^Here, a weight map means a row of weight matrix in the 1st weight 
layer, rearranged in the 2D form to visualize its spatial weighting pattern. 


Trained weights for CIFAR-10. The weight maps ob¬ 
tained through the augmented data learning exhibit two 
functionalities (see Fig. 7 (c)): the gray-scaled, local-edge 
extractor and spatially-spread, color differentiator. Similar 
findings have been pointed out by Krizhevsky et al. [9] . The 
weight maps obtained through original data learning exhibit 
no such functionalities (see Fig. 7 (d)). With lacking spatial 
structure, the generalization is really poor. 


5. Conclusion 

This paper address an issue of optimal decision rule 
for augmented data learning of neural networks. On¬ 
line data deformation scheme in network training leads 
a minimization of the loss expectation marginalized over 
deformation-controlling parameters. It is expected that ro¬ 
bustness against intra-class variation can be trained. Some 
sort of SGD can reach one of the local minima of such ob¬ 
jective function with finite-term approximation. The claim 
is that class decision must be made through similar opti¬ 
mization process; i.e., the expectation value must be mini¬ 
mized for a given test sample. This demands that a given 
test sample must be augmented using the same deformation 
function used in the training, to compute the loss expecta¬ 
tion for each class, if analytical integration is not feasible. 

Our experimental results show that the proposed way of 
classification, APAC, gives far better generalization abil¬ 
ities than traditional classification rule, which requires a 
single feedforwarding of a given test sample. Our CNN 
model achieved the best test error rate (0.23%) among non¬ 
ensemble classifiers on MNIST. Top-2 prediction using the 
model yields a test error rate of 0.01%. Through augmented 
data learning, MLP models acquire local-feature extrac¬ 
tion functionality, which is a key of avoiding over-fitting. 
Indeed in the CIFAR-10 experiment, our MLP model us¬ 
ing APAC outperforms some CNN models trained with 
recently-invented regularization techniques. 
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